摘要 :
Recently, image-text matching based on local region-word semantic alignment has attracted considerable research attention. The fine-grained interplay can be achieved by aggregating the similarities of the region-word pairs. Howeve...
展开
Recently, image-text matching based on local region-word semantic alignment has attracted considerable research attention. The fine-grained interplay can be achieved by aggregating the similarities of the region-word pairs. However, the similarities of aligned region-word pairs are treated equally in most cross-modal matching literatures, without considering their respective importance. Moreover, the local alignment methods are prone to bring about a global semantic drift due to the ignorance of thematic considerations for the image-text pairs. In this paper, a novel Dual-View Semantic Inference (DVSI) network is proposed to leverage both local and global semantic matching in a holistic deep framework. For the local view, a region enhancement module is proposed to mine the priorities for different regions in the image, which provides differentiate abilities to discover the latent region-word relationships. For the global view, the overall semantics of image is summarized for global semantic matching to avoid global semantic drift. The two views are unified together for final image-text matching. Extensive experiments conducted on MSCOCO and Flicr30K demonstrate the effectiveness of the proposed DVSI. (C) 2020 Elsevier B.V. All rights reserved.
收起
摘要 :
Image captioning is a task to generate natural descriptions of images. In existing image captioning models, the generated captions usually lack semantic discriminability. Semantic discriminability is difficult as it requires the m...
展开
Image captioning is a task to generate natural descriptions of images. In existing image captioning models, the generated captions usually lack semantic discriminability. Semantic discriminability is difficult as it requires the model to capture detailed differences in images. In this paper, we propose an image captioning framework with semantic-enhanced features and extremely hard negative examples. These two components are combined in a Semantic-Enhanced Module. The semantic-enhanced module consists of an image-text matching sub-network and a Feature Fusion layer which provides semantic-enhanced features of rich semantic information. Moreover, in order to improve the semantic discriminability, we propose an extremely hard negative mining method which utilize the extremely hard negative examples to improve the latent alignment between visual and language information. Experimental results on MSCOCO and Flickr30K show that our proposed framework and training method can simultaneously improve the performance of image-text matching and image captioning, achieving competitive performance against state-of-the-art methods. (C) 2020 Elsevier B.V. All rights reserved.
收起
摘要 :
Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult t...
展开
Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult to discover all the abundant information based on a single inter-modality relation-ship. In this paper, a novel Multi-View Inter-Modality Representation with Progressive Fusion (MIRPF) is developed to explore inter-modality relationships from multi-view features. The multi-view strategy provides more complementary and global semantic clues than single-view approaches. In particular, the multi-view inter-modality representation network is constructed to generate multiple inter -modality representations, which provide diverse views to discover the latent image-text relationships. Furthermore, the progressive fusion module is performed to fuse inter-modality features stepwise, which fully uses the inherent complementary between different views. Extensive experiments on Flickr30K and MSCOCO verify the superiority of MIRPF compared with several existing approaches. The code is available at: https://github.com/jasscia18/MIRPF. (C) 2023 Published by Elsevier B.V.
收起
摘要 :
Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual-linguistic relation alignment is proposed. The goal o...
展开
Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual-linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual-linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.
收起
摘要 :
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods c...
展开
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into independent representation matching methods and cross-interaction matching methods. The independent representation matching methods generate the embeddings of images and sentences independently and thus are convenient for retrieval with hand-crafted matching measures (e.g., cosine or Euclidean distance). As to the cross-interaction matching methods, they achieve improvement by introducing the interaction-based networks for inter-relation reasoning, yet suffer the low retrieval efficiency. This article aims to develop a method that takes the advantages of cross-modal inter-relation reasoning of cross-interaction methods while being as efficient as the independent methods. To this end, we propose a graph-based Cross-modal Graph Matching Network (CGMN), which explores both intra- and inter-relations without introducing network interaction. In CGMN, graphs are used for both visual and textual representation to achieve intra-relation reasoning across regions and words, respectively. Furthermore, we propose a novel graph node matching loss to learn fine-grained cross-modal correspondence and to achieve inter-relation reasoning. Experiments on benchmark datasets MS-COCO, Flickr8K, and Flickr30K showthat CGMN outperforms state-of-the-art methods in image retrieval. Moreover, CGMM is much more efficient than state-of-the-art methods using interactive matching. The code is available at https://github.com/cyh-sj/CGMN.
收起
摘要 :
Image-text matching is a crucial aspect of multi-modal intelligence. The main challenge in this area is accurately measuring the relevance between the image and text, using evidence obtained through matching. Previous studies eith...
展开
Image-text matching is a crucial aspect of multi-modal intelligence. The main challenge in this area is accurately measuring the relevance between the image and text, using evidence obtained through matching. Previous studies either concentrated on obtaining a well-represented global feature to measure similarity directly or on investigating complex matching patterns at a local level before aggregating them, with little attention paid to combining them. We propose a Globally Guided Confidence Enhancement Network that combines both approaches by obtaining a good global representation to guide fine-grained local interactions. In this process, content that better matches the text from a global perspective is enhanced and represented with confidence scores. Extensive experiments demonstrate that the approach we have employed achieves superior performance on Flickr30K and MSCOCO datasets.
收起
摘要 :
Image-text matching aims to find the relationship between image and text data and to establish a connection between them. The main challenge of image-text matching is the fact that images and texts have different data distribution...
展开
Image-text matching aims to find the relationship between image and text data and to establish a connection between them. The main challenge of image-text matching is the fact that images and texts have different data distributions and feature representations. Current methods for image-text matching fall into two basic types: methods that map image and text data into a common space and then use distance measurements and methods that treat image-text matching as a classification problem. In both cases, the two data modes used are image and text data. In our method, we create a fusion layer to extract intermediate modes, thus improving the image-text processing results. We also propose a concise way to update the loss function that makes it easier for neural networks to handle difficult problems. The proposed method was verified on the Flickr30K and MS-COCO datasets and achieved superior matching results compared to existing methods.(c) 2021 Elsevier B.V. All rights reserved.
收起
摘要 :
In some scripts, especially the Farsi/Arabic script, letters normally attach together and produce many different patterns, some of which are fully or partially similar. Detecting such patterns and exploiting them to reduce the lib...
展开
In some scripts, especially the Farsi/Arabic script, letters normally attach together and produce many different patterns, some of which are fully or partially similar. Detecting such patterns and exploiting them to reduce the library size, has a rather great effect on the compression ratio.
In this paper, a lossy/lossless compression method is proposed for bi-level printed text images in archiving applications. For this, we propose a new 1-D pattern matching technique in the chain coding domain that uses the proposed technique of detecting the repetitive sub-signals in order to detect the fully or partially similar patterns.
Experimental results show that the compression performance of the proposed method is considerably better than those of the existing bi-level printed text image compression methods as high as 1.8-4.2 times in the lossy case and 1.6-3.8 times in the lossless case at 300 dpi.
收起
摘要 :
? 2023 Elsevier LtdImage-text matching has become a research hotspot in recent years. The key point of image-text matching is to accurately measure the similarity between an image and a sentence. However, most existing methods eit...
展开
? 2023 Elsevier LtdImage-text matching has become a research hotspot in recent years. The key point of image-text matching is to accurately measure the similarity between an image and a sentence. However, most existing methods either focus on the inter-modality similarities between regions in image and words in text or the intra-modality similarities within image regions or words, such that they cannot well exploit detailed correlations between images and texts. Furthermore, existing methods typically train their models using a triplet ranking loss, which relies on the similarity of randomly sampled triples. Since the weights of positive and negative samples are not adjusted, it cannot provide enough gradient information for training, resulting in slow convergence and limited performance. To address the above problems, we propose an image-text matching method named Bi-Attention Enhanced Representation Learning (BAERL). It builds a self-attention learning sub-network to exploit intra-modality correlations within image regions or words and a co-attention learning sub-network to exploit inter-modality correlations between image regions and words. Then, representations obtained by two sub-networks capture holistic correlations between images and texts. In addition, BAERL uses the self-similarity polynomial loss instead of triplet ranking loss to train the model. The self-similarity polynomial loss can adaptively assign appropriate weights to different pairs based on their similarity scores so as to further improve the retrieval performance. Experiments on two benchmark datasets demonstrate the superior performance of the proposed BAERL method over several state-of-the-art methods.
收起
摘要 :
Multi-modal machine translation (MMT) aims to use other modal information to assist text machine translation and to obtain higher quality translation results. Many studies have proved that image information can improve the quality...
展开
Multi-modal machine translation (MMT) aims to use other modal information to assist text machine translation and to obtain higher quality translation results. Many studies have proved that image information can improve the quality of text machine translation. However, the multi-modal data corpus used in the translation process needs a lot of manual annotation, which makes it difficult to label the corpus, and the scarcity of data sets affects the work of multi-modal machine translation to a certain extent. To solve the problem of text-image annotation, we propose a text-image similarity matching method. This method encodes the text and image, maps them to vector space, and uses cosine similarity to obtain the image with the greatest similarity to the text to construct a multi-modal dataset. We conducted experiments on the Multi30K English German text-only corpus and the WMT21 English Hindi text-only corpus, and the experimental results showed that our method improved 8.4 BLEU compared to the text-only translation results on the Multi30K corpus. Compared with manually annotated multi-modal datasets, our method improves 4.2 BLEU. At the same time, it has improved 3.4 BLEU on low resource corpus English-Hindi, so our method can effectively improve the construction of multi-modal machine translation data sets, and to some extent, improve the development of multi-modal machine translation research.
收起